Conversation
This PR modifies all `INSERT` logic so virtual (a.k.a generated) MySQL columns are not attempted to insert into, which otherwise breaks the ferrying process. See also #338.
|
Hey @plisandro ! There was a bunch more things to tackle here, everything should be in place now, feel free to review and test, as well as sign the CLA mentioned in the checks for the contributions 👍 |
This is much appreciated, thank you! Been testing it today and your PR seems to work well. CLA is now signed as well 😄 |
|
Hey @plisandro I'm not super familiar with the CLA stuff, but the error says:
leave a comment and lets see, also I think you may have committed under your original account, so you might need to leave/sign under it as well 🤔 |
|
I have signed the CLA! |
Done, and apologies for the confusion - i'm in the process of merging accounts now, and this was actually the last contribution left with the old one 🤦 |
|
@plisandro no problem! |
|
|
||
| // Evaluates whether a TableSchema column is generated, by name. | ||
| func (t *TableSchema) IsColumnNameGenerated(name string) bool { | ||
| for _, col := range t.Columns { |
There was a problem hiding this comment.
Are we worried about iterating through columns every time? I know there can't be that many, so it's probably not an issue, but we do it in a few places in this file.
There was a problem hiding this comment.
I tested this locally, and has no discernible runtime impact when compared to a (previous) version which filters columns beforehand.
As noted in #400 (comment) , the problem with doing any pre-processing on columns is that they won't later align with DML and Binlog updates, so the logic to handle these becomes a mess.
There was a problem hiding this comment.
Sure, but is it worth it. See comment on row_batch.go
There was a problem hiding this comment.
Note that "a mess" includes being a tad slower 😄 To process row updates with generated columns missing on table definitions means that you effectively need to keep track of those columns, and their indeces, to filter them out on every update - which also involves re-aligning row data.
Filtering generated columns when SQL is constructed is actually simpler/cheaper. Either way, i tested this in a setup with tables with 50+ columns and couldn't measure any significant difference.
| flattened[rowIdx*rowSize+colIdx] = col | ||
| flattened := make([]interface{}, 0, len(e.values)*len(e.columns)) | ||
|
|
||
| for _, row := range e.values { |
There was a problem hiding this comment.
This is now going through all rows, all columns, then all columns again. #rows * #cols^2. Or am I reading it wrong?
There was a problem hiding this comment.
Asking because this will add up when you have 86 columns in a large table, and we're doing string comparisons. no?
There was a problem hiding this comment.
The original change followed TableSchema as it exists today, where ignored and compressed columns are indexed by name.
Didn't seem to have any relevant performance hit on my test setup, but if this is a consideration, we could use a hashmap instead. Note it'd have to be lazily-initialized, as TableSchema is initialized ad-hoc all over ghostferry.
ref: #400 & @proton-lisandro-pin
The back and forth is taking a bit of time, lets speed this up, I've cherry-picked your commits so the attribution is there, CLA was signed on original PR, if you could sign the one here as well it would be 👍
Handle MySQL Generated Columns (STORED and VIRTUAL) in Data Replication
This PR adds support for MySQL generated columns (both
VIRTUALandSTORED) to Ghostferry, enabling proper handling of computed columns during selective data replication.Problem Statement
MySQL 8.0.23 introduced significant changes to how generated columns are handled in binary log ROW events:
Without special handling, Ghostferry would fail or produce incorrect results when replicating tables with generated columns, as it would attempt to insert values into columns that cannot be modified or would have incorrect column positions.
Solution
This PR implements a comprehensive solution with four key components:
Row Expansion - Detects when MySQL omits virtual columns from binlog events and re-inserts nil sentinels to maintain consistent full-schema column indexing throughout the pipeline.
Insert Value Filtering - Filters out generated column values before constructing
INSERTstatements, allowing only modifiable columns to be inserted while using proper column metadata for value escaping.Unsigned Integer Normalization - Fixed the order of operations: row expansion happens before unsigned integer normalization, ensuring consistent full-schema column indexing throughout.
Verification with Generated Columns - Includes all columns (including generated) in fingerprint queries to detect divergence when computed values differ between source and target databases.
Changes
Core Implementation
INSERTevent handling to filter out generated columns and improved binlog event processing for MySQL 8.0.23+ compatibilityIsColumnGenerated()- Identifies virtual and stored columnsNonGeneratedColumnNames()- Returns only insertable columnsFilterGeneratedColumnsOnRowData()- Removes generated column values from rowsTest Coverage
VIRTUALandSTOREDcolumnsEdge Cases Handled
VIRTUALandSTOREDcolumns in same tableTesting
Related Issue
Closes #338